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ABSTRACT 

We present a novel approach to computational modeling 
of social interactions based on modeling of essential social 
interaction predicates (ESIPs) such as joint attention and 
entrainment. Based on sound social psychological theory 
and methodology, we collect a new “Tower Game” dataset 
consisting of audio-visual capture of dyadic interactions la¬ 
beled with the ESIPs. We expect this dataset to provide a 
new avenue for research in computational social interaction 
modeling. We propose a novel joint Discriminative Con¬ 
ditional Restricted Boltzmann Machine (DCRBM) model 
that combines a discriminative component with the gener¬ 
ative power of CRBMs. Such a combination enables us to 
uncover actionable constituents of the ESIPs in two steps. 
Eirst, we train the DCRBM model on the labeled data and 
get accurate (76%-49% across various ESIPs) detection of 
the predicates. Second, we exploit the generative capability 
of DCRBMs to activate the trained model so as to generate 
the lower-level data corresponding to the specific ESIP that 
closely matches the actual training data (with mean square 
error 0.01-0.1 for generating 100 frames). We are thus able 
to decompose the ESIPs into their constituent actionable be¬ 
haviors. Such a purely computational determination of how 
to establish an ESIP such as engagement is unprecedented. 

Categories and Subject Descriptors 

H.1.2 [Models and Principles]: User/Machine Systems— 
Human information processing] 1.2.10 [Artificial Intelli¬ 
gence]: Vision and Scene Understanding 

General Terms 

Algorithms, Theory, Human Eactors 
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1. INTRODUCTION 

This research brings together multiple disciplines to ex¬ 
plore the problem of social interaction modeling. The goal 
of this work is to leverage research in social psychology, com¬ 
puter vision, signal processing, and machine learning to bet¬ 
ter understand human social interactions. 

As an example application, consider aid-workers or med¬ 
ical personnel deployed in a foreign country. During the 
course of their deployment, these workers often have to in¬ 
teract with people with whom they share little in common in 
terms of language, customs and culture. Reducing friction 
as well as increasing engagement between the workers and 
the populations they encounter can have an important bear¬ 
ing on the success of their mission. Therefore the ability to 
impart such professionals, with a general cross-cultural com¬ 
petency which would enable them to smoothly interact with 
the foreign populations they encounter would be extremely 
useful. With such an application in mind, we focus on iden¬ 
tifying and automatically detecting predicates that facili¬ 
tate social interactions irrespective of the cultural context. 
Since our interests lie in aspects of social interactions that 
reduce conflict and build trust, we focus on social predicates 
that support rapport: joint attention, temporal synchrony, 
mimicry, and coordination. 

Our orientation to social sensing departs significantly from 
existing methods that focus on inferring internal or 

hidden mental states. Instead, inspired by a growing body 
of research P6p0p5] , we focus on the process of social inter¬ 
action. This research argues that social interaction is more 
than the meeting of two minds, with an additional emphasis 
on the cognitive, perceptual and motor explanations of the 
joint and coordinated actions that occur as part of these in¬ 
teractions [^. Our approach is guided by two key insights. 
The first is that apart from inferring the mental state of the 
other, social interactions require individuals to attend each 
other’s movements, utterances and context to coordinate ac¬ 
tions jointly with each other [^. The second insight is that 
social interactions involve reciprocal acts, joint behaviors 





Figure 1: (a) Our capture setup which includes a GoPro camera mounted on each participant’s chest and a Kinect mounted 
on a tripod, (b) An overhead view of our capture setup involving the two participants, (c) Sample Data Collected: The image 
outlined in solid red shows the image captured from the GoPro camera mounted on player A (green shirt), while the image 
outlined in dashed red shows the image captured from the Kinect behind player A and is used to track the upper body of 
player B (red shirt). Similarly the image outlined in solid green is the image captured from the GoPro mounted on player B 
and the image outlined in dashed green is the image captured from the Kinect behind player B. (d) A view of our collected 
data projected in a unified coordinate framework. 


along with nested events (e.g. speech, eye gaze, gestures) 
at various timescales and therefore dema nd a daptive and 
cooperative behaviors of their participants . 

Using the work of as a starting point, which em¬ 

phasizes the interactive and cooperative aspects of the so¬ 
cial interactions, we focus on detecting rhythmic coupling 
(also known as entrainment and attunement), mimicry (be¬ 
havioral matching), movement simultaneity, kinematic turn 
taking patterns, and other measurable features of engaged 
social interaction. We established that behaviors such as 
joint attention and entrainment were the essential predi¬ 
cates of social interaction (ESIPs). With this in mind we 
focus on developing computational models of social inter¬ 
action, that utilize multimodal sensing and temporal deep 
learning models to detect and recognize these ESIPs as well 
as discover their actionable constituents. 

Over the past decade, the fields of computer vision and 
machine learning have made significant advances. Eurther- 
more, with the availability of complex sensors like Kinect, 
researchers are able to accurately track full human body 
poses [^. This allowed for many different applications in 
such as activity recognition , facial feature tracking , 
and multimodal event detection [22] . 

The sophistication of our problem requires a machine learn¬ 
ing algorithm capable of jointly recognizing, correlating fea¬ 
tures, and generating multimodal data of dyadic social in¬ 
teractions. Discriminative models focus on maximizing the 
separation between classes, however, they are often uninter¬ 
pretable. On the other hand, generative models focus solely 
on modeling distributions and are often unable to incorpo¬ 
rate higher level knowledge. Hybrid models tend to address 
these problems by combining the advantages of discrimina¬ 
tive and generative models. They encode higher level knowl¬ 
edge as well as model the distribution from a discriminative 
perspective. We propose a novel hybrid model that allows us 
to recognize classes, correlate features, and generate social 
interaction data. 

This paper proposes new approach to machine learning 
that answers questions posed by social psychology. Our 
approach to social sensing is multimodal and attempts to 
detect the existence of features of social interaction, social 
interaction itself, and the qualitative and dynamic features 
of social interaction. We took a multimodal approach be¬ 


cause humans must solve a variety of binding problems to 
effectively coordinate action. Goordination must span every¬ 
thing from postural sways, eye gazes, head pose, gestures, 
lexical choice, verbal pitch and intonation, etc. 

Our eontrihutions are 3-fold: 

• A new problem of computational modeling of essential 
social interaction predicates (ESIPs). Starting from 
a socio-psychological framework, we demonstrate the 
use of multimodal sensors and temporal deep learning 
models to uncover actionable constituents of ESIPs. 

• A new dataset. Tower Game Dataset, for analyzing 
social interaction predicates. The dataset consists of 
multimodal recordings of two players participating in 
a tower building game, in the process communicating 
and collaborating with each other. The dataset has 
been annotated with ESIPs and will be made publicly 
available. We believe that it will foster new research in 
the area of computational social interaction modeling. 

• A novel model. Discriminative Gonditional Restricted 
Boltzmann Machine (DGRBM), that introduces a dis¬ 
criminative component to Gonditional Restricted Boltz¬ 
mann Machines (GRBM). The discriminative compo¬ 
nent enables DGRBMs to directly learn classification 
models while retaining all the advantages of GRBMs, 
including their ability to generate missing data. Re¬ 
sults on the Tower Game Dataset demonstrate that 
DGRBMs can effectively detect ESIPs as well decom¬ 
pose ESIPs into their constituent actionable behaviors. 

Paper organization: In sec.we discuss prior work. In sec. 
we specify our model, then we explain inference and learn¬ 
ing. In sec. 1^ we describe our dataset and demonstrate the 
quantitative results of our approach. In sec. we conclude. 


2. RELATED WORK 

Social Psychology: The study of social interactions and 
their associated sociological and psychological implications 
has received a lot of attention from social science researchers 
26 Early research focused on the “Theory of Mind” ac¬ 


cording to which individuals ascribe mental states to them¬ 
selves and others [^, a line of thinking that largely inspired 
much of the initial work on affective computing. However, 
more recent work has shown that apart from inferring each 









other’s mental states, an important challenge for partici¬ 
pants of a social interaction is to pragmatically sustain se¬ 
quences of action where the action is tightly coupled to one 
another via multiple channels of observable information (e.g. 
visible kinematic information, audible speech). In other 
words, social interactions require dynamically coupled in¬ 
terpersonal motor coordination from their participants [46] . 
Moreover, detecting coupled behaviors such as kinematic 
turn taking or simultaneity in movements can help in recog¬ 
nizing engaged social interactions [10| . 

Affective Computing: refers to the study and develop¬ 
ment of systems that can automatically detect human af¬ 
fect [8 39 . Affective computing has long been an active 


research area due to its utility in a variety of applications 
that require realistic Human Computer Interaction, such as 
online tutoring and health screenings . The goal here 
is to detect the overall mental or emotional state of the per¬ 
son based on external cues. This is typically done based on 
speech [^, facial expressions [^, gesture/posture and 
multimodal cues [2 There has also been work on 

modeling activities and interactions involving multiple peo¬ 
ple [^[ 4 ^( 44 ] . However, most o f this w ork deals with short 
duration task-oriented activities [23|44| with a focus on their 
physical aspects. There has been a recent interest in model¬ 
ing interactions with a focus on the rich and complex social 
behaviors that they elicit along with their affective impact 
on the participants [43| . 

Hybrid Models: consist of a generative model, which usu¬ 
ally learns a feature representation of low level input, and a 
discriminative model for higher level reasoning. Recent work 
has empirically shown that generative models which learn a 
rich feature representation tend to outperform discrimina¬ 
tive models that rely solely on hand-crafted features [38] . 
Hybrid models can be divided into three groups, joint meth¬ 
ods [T2|[24||2^[32|, iterative methods [l^[49|, and staged 
■ .. 


21[|28||M m- Joint methods optimize a sin- 


methods 

gle objective function which consists of both the generative 
and discriminative energies. Iterative methods consist of a 
generative and a discriminative model that are trained in an 
iterative manner, influencing each other. In staged methods, 
both models are trained separately, with the discriminative 
model being trained on representations learned by the gen¬ 
erative model. Classification is performed after projecting 
the samples into a fixed-dimensional space induced by the 
generative model. 

Deep Networks: are able to learn rich features in an unsu¬ 
pervised manner, this is what makes deep learning very pow¬ 
erful. They have been successfully applied to many prob¬ 
lems [^. Restricted Boltzmann Machines (RBMs) form the 
building blocks in deep networks models [^[45] . In , 

the networks are trained using the Contrastive Divergence 
(CD) algorithm [^, which demonstrated the ability of deep 
networks to capture the distributions over the features effi¬ 
ciently and to learn complex representations. RBMs can be 
stacked together to form deeper networks known as Deep 
Boltzmann Machines (DBMs), which capture more com¬ 
plex representations. Recently, deep networks based tem¬ 
poral models, capable of modeling a more temporally rich 
set of problems have been proposed. These include Condi¬ 
tional RBMs (CRBMs) [^ and Temporal RBMs (TRBMs) 
[18 51 CRBMs have been successfully used in both vi¬ 
sual and audio domains. They have been used for modeling 
human motion [M, tracking 3D human pose and phone 


recognition [^. TRBMs have been applied for transferring 
2D and 3D point clouds [^, transition based dependency 
parsing [^, and polyphonic music generation [27] . 


3. APPROACH 

In this section we describe our approach. We first review 
similar prior work, next we define our model, formulate its 
inference, and finally show how the model parameters are 
learned. 

3.1 Review of Prior Models 

Restricted Boltzmann Machines |^: An RBM (Fig. |2a| l 

defines a probability distribution pR as a Gibbs distribution 
Q, where v is a vector of visible nodes, h is a vector of 
hidden nodes, is the energy function and Z is the par¬ 
tition function which ensures that the distribution is valid. 
The parameters 0r to be learned are a and b the biases for 
V and h respectively and the weights W. The RBM archi¬ 
tecture is dehned as fully connected between layers, with no 
lateral connections. This architecture implies that v and h 
are factorial given one of the two vectors. This allows for 
the exact computation of p(v|h) and p(h|v). 

Pr(v, h;0R) = exp[-^R(v, h)]/Z(0R), 

Z{0r) = Ev,hexp[--ER(v,h)], (1) 

On = {a,b, IF} 

In case of binary valued data Vi is defined as a logistic func¬ 
tion. In case of real valued data, Vi is defined as a mul¬ 
tivariate Gaussian distribution with a unit covariance. A 
binary valued hidden layer hj is defined as a logistic func- 
tiorjj This is done because we want the hidden layer to be a 
sparse binary code (empirically proven to be better |52||54]). 
§ shows the probability distributions for v 


w 

+ 

II 

II 

Binary, 

p(vi|h) = hjWij , 1), 

Real, (2) 

p{hj = l|v) = a{bj + Ei ViWij), 

Binary. 

The energy function Eji for binary Vi is defined as in 

er{v, h) = - y] mvi - y] bjhj - y] 

i 3 i,3 

ViWijhj, (3) 

while, the energy function En is slightly modified to allow 


for the real valued v as shown in 
Sr(v, h) = - ViWi,jhj (4) 

i 3 i,j 

Discriminative Restricted Boltzmann Machines [^ : 

DRBMs are a natural extension of RBMs which have an 
additional discriminative term for classification. They are 
based on the model in . DRBM (Fig. [21^ defines a prob- 

^The logistic function cr(-) for a variable x is defined as 
a{x) = (1 + exp{—x))~^. 
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(d)DCRBM are discriminatively trained hybrid models. 


ability distribution pD as a Gibbs distribution 

PDR(y,V, h|v;0DR) = exp[-SDR(y,V,h)]/Z(0DR), 

Z{0or) = Ey,v,he^P[--®DR(y,v,h)] 

0c = {a,b,s, W", [/} 

(5) 

The probability distribution over the visible layer will fol¬ 
low the same distributions as in §. The hidden layer h is 
dehned as a function of the labels y and the visible nodes 
V. Also, a new probability distribution for the classiher is 
dehned to relate the label y to the hidden nodes h as in 


p{vi\h) 


p{hj = l|j/fe,v) 


V(ai + Ej 

a{bj + Uj,k + Ei ViiVij), 


p{yk\h) 


® fe “1“ '^j , k 

+E, 


j,k* 


The new energy function ^dr is dehned similar to 0 - 


( 6 ) 


SD(y,v,h) 


- Ei(ai - «i)V2 - Ej bjhj - Efc SkVk 
~ — ^j^k bjUjUk 


(7) 


Conditional Restricted Boltzmann Machines : CRBMs 
are a natural extension of RBMs for modeling short term 
temporal dependencies. A CRBM (Fig.[^ is an RBM which 
takes into account history from the previous time instances 
[(t —iV),..., (t — 1)] at time (t). This is done by treating the 
previous time instances as additional inputs. Doing so does 
not complicate inferenc^ A CRBM dehnes a probability 
distribution pc as a Gibbs distribution 0- 


pc(vt,ht|v<t;0c) = exp[-Sc(vt,ht|v<t)]/^(0c), 

Zi0c) = Ev,he^P[--^c(vt,ht|v<t)] 

0c = {si,h, A, B,W} 

( 8 ) 

The additional inputs from previous time instances are mod¬ 
eled as directed autoregressive edges from the past N visible 
nodes and the past M hidden layers, where, N does not have 
to be equal to M. The concatenated history vector is de¬ 
hned as v<t. The probability distributions are dehned in 

0- 

p(t)i|h,V<f) = VCai + Eu"4n,it’n,<f + Ej 
p{hj = l|v, V<i) = a{bj + Bm,jVm,<t + Yi'Vi'U’ij). 

(9) 

The new energy function i?c(vt, ht|v<t) in ( |10| | is defined 
in a manner similar to that of the RBM 0. 

Sc(vi, ht|v<t) = - Ei - Ej d3hj,t 

( 10 ) 


where. 


— T ^ ^ ^ 




dj — T ^ ^ Bm, 




^Some approximations have been made to facilitate efficient 
training and inference, more details are available in [54| . 















Note that A and B are matrices of concatenated vectors of 
previous time instances of a and b. 


3.2 Model 

Discriminative Conditional Restricted Boltzmann Ma¬ 
chines: (DCRBMs) are a natural extension of CRBMs which 
have an additional discriminative term for classification. They 
are based on the model in [^, generalized to account for 
temporal phenomenon using CRBMs. DCRBMs (Fig. |2d| ) 
are a simpler version of the Factored Conditional Restricted 
Boltzmann Machines and Gated Restricted Boltzmann 
Machines . Both these models incorporate labels in learn¬ 
ing representations, however, they use a more complicated 
potential which involves three way connections into factors. 
DCRBM defines a probability distribution pr>c as a Gibbs 
distribution (11). 


For generation we use a combination of top-down/bottom- 
up depending on the type of generation by activating the 
required layers given the available data, as in Fig- 


4a and 5a show the two cases. The first case(Fig. [4a| 


deals with partial missing data, where we have partial data 
for the hidden layer vt as well as the label y, and our goal 
is to generate the missing part of the vt- The second case 
(Fig. |5a| is when we have a fully missing visible layer vt 
and our goal is to generate it given only the class label y. 
For both cases we assume we have access to some history 
information. 

Learning: Learning our model is done using Contrastive 
Divergence (CD) [^. The update equations of the dynam¬ 
ically changing bases Ac and Ad are obtained by first up¬ 
dating AA and AB as in the case of the real valued CRBM 
§ and then combining them with Aa and Ab. 


Puc (yt , Vt, ht I v<f; 0 DC ) = exp[-£DC (yt, vt, ht | v<f )] /Z(0 dc ), 
Z( 0 Dc) = Eyt,vt.ht exp[-F;Dc(yt,vt,ht|v<f)], 

0DC = {R,b,s,A,B, W, [/}. 

( 11 ) 

The probability distribution over the visible layer will fol¬ 
low the same distributions as in The hidden layer h is 
defined as a function of the labels y and the visible nodes v. 

A new probability distribution for the classifier is defined to 
relate the label y to the hidden nodes h is defined as in (12). 


p(i;i,t|ht, v<t) = N{ai + An,iVn,<t + Y.j hjWij, 1), 
P{hj,t = l|yt,vt,v<t) = 

a{bj + Uj^k + Y.i + Y,m Bm,jVm,<t), 


AWi^j 

(X 

{Vihj ) data 

{'a ih j) r econ 1 


AUj^k 

(X 

{yk hj ) data 

{ykhj) recon ■) 


Aat 

(X 

{pi) data 

{'^i) reconi 


Abj 

(X 

{hj)data 

{hj)reconi 

(15) 

Ask 

(X 

{yk') data 

{Vk) reconi 


AAk^i^t — n 

(X 

'^k,t — n{ {'^i,t) data 

{'^ijt) recon^ 


ABi^j^t — m 

(X 

'^i,t — m{ {hj^t) data 

{hj ,t) recon^ 


where {■)data 

is 

the expectation with respect to the 

data 


distribution and {•)recon is the expectation with respect to 
the reconstructed data. The reconstruction is generated by 
first sampling p{hj = l|v,y) for all the hidden nodes in 
parallel. The visible nodes are then generated by sampling 
p(vi\h) for all the visible nodes in parallel. Finally, the label 


nodes are generated using p{y\h.) using (12). 

4. EXPERIMENTS 


p(2/fc,t|h) = 








3 ^3,k* 


( 12 ) 

The new energy function E^c is defined similar to that of 
the DRBM 0- 

-BDc(yf,Vf,ht|v<t) = /2 - Y,. dj,khj,t 


^kyk,t yyii 7 ^ ^j,t'^j,kyk,t 


where. 


(13) 


Ci — CLi ^ ^ An,iVn,<t ^ dj,k — ^j,t T '^j,k T ^ ^ Bm, 




Note that A and B are matrices of concatenated vectors 
of previous time instances of a and b. 


3.3 Inference and Learning 

Inference: For classification we use a bottom up approach, 
where we maximize the posterior distribution, pDc(yt,fc|vt, v<t), 
over all the labels. This is equivalent to activating the hid¬ 
den layer given the visible layer vt, visible layer history v<t), 
and label yt,k as shown in ( |14[ ). 

yt = argmaxfc PDc( 2 /t,/e|vt, v<t), where. 


PDc(yt,fc|vt, V<t) 


( 14 ) 


In this section, we first discuss existing activity recogni¬ 
tion and affective computing datasets. Next we describe the 
collection and annotation of our Tower Game Dataset, which 
contains recordings of two players building a tower and in 
the process engaging in a variety of interactive behaviors. Fi¬ 
nally, we describe our experimental results on this dataset, 
demonstrating the effectiveness of our DCRBM model. 

4.1 Datasets 

Most existing activity recognition benchmarks - e.g., the 
Weizmann, Trecvid, PETS04, CAVIAR, IXMAS, Hollywood 
datasets, Olympic Sports and UCF-100 - contain relatively 
simple and repetitive actions involving a single person [^. 
On the other hand, group activity recognition datasets such 
as UCLA Courtyard, UT-Interactions, Collective Activity 
datasets, and Volleyball dataset, lack rich social dynamics. 

Other relevant datasets include the Multimodal Dyadic 
Behavior (MMDB) dataset [^, which focuses on analyzing 
dyadic social interactions between adults and children in a 
developmental context. This dataset was collected in a semi- 
structured format, where children interact with an adult ex¬ 
aminer in a series of pre-planned games. However, due to its 
narrow focus on analysis of social behaviors to diagnose de¬ 
velopmental disorders in children, we believe it is not general 
enough. Another dataset is the Mimicry database which 
focuses on studying social interactions between humans with 
the aim of analyzing mimicry in human-human interactions. 
This dataset was collected in an unstructured format where 
the two humans talk to each other about different subjects. 









There are a number of issues with the aforementioned 
datasets, including: (a) unnatural, acted activities in con¬ 
strained scenes; (b) limited spatial and temporal coverage; 
(c) poor diversity of activity classes; (d) Lack of rich so¬ 
cial interactions; (e) Narrow focus on a single behavior (e.g. 
mimicry); and (f) Unstructured or uncontrolled collection 
setup. Hence, we propose our new Tower Game Dataset to 
address the above issues. 

Tower Game Dataset is a simple game of tower build¬ 
ing often used in social psychology to elicit different kinds 
of interactive behaviors from the participants. It is typically 
played between two people working with a small fixed num¬ 
ber of simple toy blocks that can be stacked to form various 
kinds of towers. We choose these tower games as they force 
the players to engage and communicate with each other in 
order to achieve the objectives of the game, thereby evoking 
behaviors such as joint-attention and entrainment from the 
participants. The game, due to its simplicity, allows for to¬ 
tal control over the variables of an interaction. Due to the 
small number of blocks involved, the number of potential 
moves (actions) is limited. Also since the game involves in¬ 
teracting with physical objects, joint-attention is mediated 
through concrete objects. Furthermore, only two players are 
involved, ensuring that we can stay in the realm of dyadic 
interactions. 

There are many different variants of the game. We settled 
on two variants designed to elicit maximum communication 
between the players, namely, (i) the architect-builder vari¬ 
ant and (ii) the distinct-objective variant. Furthermore, 
in order to maximize the amount of non-verbal communica¬ 
tion, we prohibited the participants from verbally commu¬ 
nicating with each other. 

The architect-builder variant involves one participant 
playing the role of the architect, who decides the kind of 
tower to build and how to build it. The second participant 
is the builder, who has control of all the building blocks 
and is the only one actually manipulating the blocks. The 
goal here is for the architect to communicate to the builder 
how to build the tower so that builder can build the desired 
tower. 

The distinct-objective variant is slightly more compli¬ 
cated and is designed to elicit more interaction between the 
players. In this variant, each player is given half of the build¬ 
ing blocks required to build the tower. Each player is also 
given a particular rule, restricting the configuration of the 
tower being built, that they are required to enforce. An ex¬ 
ample rule could be that no two blocks of the same color 
may be placed such that they are touching each other. To 
make the play interesting, each player only knows their own 
rule and is not aware of rule given to the other player. The 
rules are selected at random from a small rule book. While 
certain combinations of rules may result in some conflict be¬ 
tween the objectives of the two players, this is typically not 
the case. However, since each player needs to adhere to their 
rule, it means that they will need to correct an action taken 
by the other if it conflicts with their rule. In the process, 
each player also tries to figure out the rule assigned to the 
other player so that the process of building the tower is more 
efficient. Also, when the subjects played multiple sessions of 
this game, the pieces used were changed and the area of the 
table upon which they could place blocks was reduced in size. 


Capture Setup: Our sensors include a pair of Kinect cam¬ 
eras that record color videos, depth video and track skele¬ 
tons of the players and a pair of GoPro cameras mounted 
on the chest of each player (Fig. [^a)). External lapel mi¬ 
crophones were attached to the GoPro cameras. However, 
the audio captured from them was used only for data syn¬ 
chronization purposes. Since the players were not allowed 
to verbally communicate with each other, very little speech 
(or paralinguistic) data exists. 

In order to ensure optimal data capture from the Kinect 
cameras (i.e. minimal occlusions and optimal skeleton track¬ 
ing), they were mounted on tripods facing one another, slightly 
to the right and back of each of the participants and slightly 
elevated, ensuring that each camera got an unobstructed 
view of the other participant. The overhead layout is shown 
in Eig.[^b). These videos are of VGA resolutions (640x480) 
and were captured at 30Hz. The GoPro cameras were set to 
capture at full HD (1920x1080) resolution and at the widest 
angle available. They were placed on the harnesses rotated 
90 degrees so as to capture the face of the other player as 
well as the blocks on the table (Eig. ^C)). 

In each session, the subjects play the game by standing at 
either end of a small rectangular table as shown in Eig.[^c). 
The person supervising the data collection enters player in¬ 
formation and other meta-data about the game session into 
a form and then starts recording. He/she then instructs the 
players to begin their game session. They first manually ac¬ 
tivate the GoPro cameras to start recording and then clap 
their hands before starting their sessions. These claps were 
used to automatically synchronize the GoPro videos with 
the Kinect videos. The final dataset consists of the follow¬ 
ing data types for each game session: 

1. Two Kinect videos (RGB) 

2. Two depth videos (depth encoded in RGB) 

3. Two GoPro videos (distortion corrected) 

4. Intrinsic and extrinsic calibrations for the two Kinect 
cameras 

5. Intrinsic calibrations and video frame aligned sequences 
of camera poses for the GoPro cameras 

6. Kinect tracked skeletons for the two participants 

7. Pace and head pose tracking for the two individuals 
from the GoPro cameras when visible 

8. Eye Gaze information (3d vectors) for the two partic¬ 
ipants whenever available 

9. Object positions (2d bounding boxes, not 3d positions) 
and tracks for all the blocks within each gaming ses¬ 
sion. 

Data Annotation: Since our focus is on joint attention 
and entrainment^ we annotated 112 videos which were di¬ 
vided into 1213 10-second segments indicating the presence 
or absence of these two behaviors in each segment. To an¬ 
notate the videos, we developed an innovative annotation 
schema drawn from concepts in the social psychology liter¬ 
ature [^1^. The annotation schema is a series of questions, 
that could be used as a guideline to assist the annotators. 
The annotation schema associates high level social interac¬ 
tion predicates with more objectively perceptible measures. 
Eor example. Joint attention is the shared focus of two in¬ 
dividuals on a common subject and it involves eye gaze (on 
a person and on an object) and body language. Similarly, 


entrainment is the alignment in the behavior of two individ¬ 
uals and it involves simultaneous movement, tempo similar¬ 
ity, and coordination. Each measure was rated using a low, 
medium, high measure for the entire 10 second segment. We 
hired six undergraduate sociology and psychology students 
to annotate the videos. The students were given a general in¬ 
troduction to the survey instrument and were then asked to 
code representative samples of the videos. The videos were 
annotated after ensuring that all the students as a group 
were annotating the sample videos accurately and reliably. 

The dataset will be released with the acceptance of this 
paper. We will also publish a fully detailed description of 
the collection, capture, and annotation. 

4.2 Quantitative Results 

In this section we describe the set of experiments we con¬ 
ducted to evaluate our proposed model. 

Implementation Details: For our experiments, we relied 
only on the skeleton features. We use the 11 joints from the 
upper body of the two players since the tower game almost 
entirely involves only upper body actions. 

Using the 11 joints we extracted a set of first order static 
and dynamic handcrafted skeleton features. The static fea¬ 
tures are computed per frame. The features consist of, rela¬ 
tionships between all pairs of joints of a single actor, as well 
as the relationships between all pairs of joints of both the 
actors. The dynamic features are extracted per window (a 
set of 300 frames). In each window, we compute first and 
second order dynamics (velocities and accelerations) of each 
joint, as well as relative velocities and accelerations of pairs 
of joints per actor, and across actors. The dimensionality 
of the static and dynamic features is (257400 D). To reduce 
their dimensionality we use Principle Component Analysis 
(PCA) (100 D), Bag-of-Words (BoW) (100 and 300 D) [3^ . 
We also extracted Deep Learning features using RBMs and 
CRBMs (50 dimensions) 

For the DRBM and DCRBM we used the raw joint lo¬ 
cations normalized with respect to a selected origin point. 
We used the same dimensionality for both models D(v) = 
66,D(h) = 50. For DCRBM we empirically evaluated his¬ 
tory windows of different sizes, and found that a window of 
size n = 15 works the best. 

Results: For the purpose of this paper we focused on the 
three ECIPs, Coordination, Simultaneous Movement, and 
Tempo Similarity. As a baseline we used a multi-class Sup¬ 
port Vector Machine and the different types of features de¬ 
fined above to classify a certain ECIP. 

We divided our evaluation into two tasks. The first task 
is the Classifieation Task. We use the raw features of the 
two players and our goal is to predict the level (strength) of 
the three ECIPs. Each ECIP can be low, medium or high, 
hence random classification accuracy is 33%. The data is 
split into two sets, a training set consisting of 70% of the in¬ 
stances, and a test set consisting of the remaining 30%. We 
performed a 5 fold cross validation to guarantee unbiased re¬ 
sults. Figure shows our average classification accuracy on 
the Tower Game Dataset using different features and base¬ 
lines combinations as well as the results from our DCRBM 
model. The evaluation is done with respect to the six an¬ 
notators {Ai, A 2 ,... ,Aq} as well as the mean annotation. 
We can see that the DCRBM model outperforms all the 


Simultaneous Movement 

Classifier/Annotator 

Al 

A2 

A3 

A4 

A5 

A6 

All 

SVM + 

Raw Skeleton 

45.20 

37.45 

32.61 

38.39 

37.06 

50.47 

39.52 

SVM + 

PCA (lOOD) 

42.83 

21.26 

37.36 

39.35 

31.76 

62.73 

47.84 

SVM + 

BoW (lOOD) 

33.17 

36.55 

35.46 

36.90 

39.25 

50.02 

44.27 

SVM + 

BoW (300D) 

38.48 

33.46 

40.79 

41.60 

41.23 

50.08 

42.84 

SVM + RBM 
(v=66, h=50) 

43.52 

31.06 

42.17 

38.73 

41.90 

56.36 

43.17 

D-RBM 
(v=66, h=50) 

49.17 

34.83 

44.02 

40.17 

45.47 

66.55 

44.02 

SVM + CRBM 
(v=66, h=50, n=15) 

48.54 

33.37 

43.46 

34.08 

44.39 

66.04 

42.45 

D-CRBM 

(v=66, h=50, n=15) 

55.15 

40.79 

48.61 

45.39 

50.02 

70.48 

49.17 


(a) Simultaneous Movement 


Coordination 

Classifier/Annotator 

Al 

A2 

A3 

A4 

A5 

A6 

All 

SVM + 

Raw Skeleton 

59.00 

54.56 

46.91 

73.31 

79.21 

50.57 

52.21 

SVM + 

PCA (lOOD) 

58.75 

48.01 

52.25 

79.68 

83.03 

44.61 

58.16 

SVM + 

BoW (lOOD) 

52.15 

63.57 

40.06 

58.03 

73.25 

47.38 

55.76 

SVM + 

BoW (300D) 

60.30 

64.94 

35.09 

59.53 

74.88 

51.32 

47.54 

SVM + RBM 
(v=66, h=50) 

61.57 

50.01 

41.46 

68.94 

82.15 

50.38 

58.94 

D-RBM 
(v=66, h=50) 

71.57 

60.15 

48.93 

79.38 

87.57 

53.03 

59.75 

SVM + CRBM 
(v=66, h=50, n=15) 

70.52 

47.35 

43.09 

78.15 

87.86 

53.74 

59.38 

D-CRBM 

(v=66, h=50, n=15) 

85.71 

68.22 

57.15 

82.53 

89.30 

55.86 

62.08 


(b) Coordination 


Tempo Similarity 

Classifier/Annotator 

Al 

A2 

A3 

A4 

A5 

A6 

All 

SVM + 

Raw Skeleton 

68.65 

44.55 

53.50 

82.35 

72.00 

71.65 

59.29 

SVM + 

PCA (lOOD) 

64.86 

50.48 

51.77 

86.24 

82.21 

81.32 

72.81 

SVM + 

BoW (lOOD) 

65.10 

45.60 

40.49 

58.03 

73.25 

47.38 

65.64 

SVM + 

BoW (300D) 

57.04 

41.29 

46.06 

77.21 

58.05 

69.64 

54.36 

SVM + RBM 
(v=66, h=50) 

66.38 

50.77 

44.21 

82.65 

75.32 

73.00 

68.04 

D-RBM 
(v=66, h=50) 

69.55 

51.24 

51.77 

87.03 

77.86 

80.24 

71.32 

SVM + CRBM 
(v=66, h=50, n=15) 

68.06 

51.03 

45.65 

86.10 

77.01 

77.38 

71.71 

D-CRBM 

(v=66, h=50, n=15) 

71.86 

55.77 

54.03 

88.76 

85.49 

83.01 

76.52 


(c) Tempo Similarity 

Figure 3: Average classification results on ESIPs. It is clear 
that the DCRBM outperforms all other baselines on the 
three ESIPs. 
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(a) Generating partial visible layer data 




Figure 4: Average generation error for the partial visible 
layer by varying the generated window size for the three 
different ESIPs. 


(b) Average generation error (missing full visible) 

Figure 5: Average generation error for the full visible layer 
by varying the generated window size for the three different 
ESIPs. 







































































































































other models for each of the three measures across all an¬ 
notators, thereby demonstrating its effectiveness on detect¬ 
ing these entrainment measures. Furthermore, the DCRBM 
model outperforms the PC A and BoW based features which 
are derived from the high dimensional handcrafted features, 
demonstrating its ability to learn a rich representation start¬ 
ing from the raw skeleton features. Finally, the performance 
of the DCRBM model indicates that the joint learning and 
inference of DCRBMs is superior to the staged approach of 
the SVM + CRBM model. 


The second task is the Generation Task, where we are 
given the class label and our goal is to generate the data 
(i.e. the raw features) for that label. This task allows us 
to visualize what the classifier has learned. For generation, 
we initialize the model using 15 frames for each person, and 
then generate sequences of lengths varying from 16 to 300 
frames. We measure the mean error between the ground- 
truth data and the generated data for each class label over 
50 video instances. For this experiment, we evaluated gen¬ 
erated sequences of varying length using a normalized mean 
squared error metric defined in (16). 


Generation Error = 


IVGenerated VGroundtruth | 

11VGroundtruth 11 


2 


(16) 


Generation is done in two different settings. In the first set¬ 
ting, given partial visible player data (one player’s features) 
as well as the class label, the goal is to generate the other 
player’s data. Figure shows our average generation error 
using our DCRBM model for generating the partial visible 
layer. In the second setting, given only the class label, the 
goal is to generate the entire visible layer data (i.e. the raw 
features for both the players). Figureshows our average 
generation error on using our DCRBM model for generating 
the full visible layer. We can see that the generation is rela¬ 
tively low (< 0.1) in all cases (except for Tempo Similarit}0 
when generating the entire visible layer data) demonstrat¬ 
ing the effectiveness of DCRBM model for generating data. 
Also, the error is similar across different levels (strengths) for 
each measure indicating that the model is relatively stable. 
Finally, the error increases with the length of the generated 
sequence, which is expected as the possibility of variation in 
the ground-truth sequences increases with length. 

Therefore, the classification task shows that DCRBMs can 
effectively detect the constituents of entrainment (an ESIP). 
Similarly, the generation task shows that DCRBMs can ef¬ 
fectively generate raw skeleton data of the actors while mod¬ 
eling the different strengths of each constituent (measure). 


5. CONCLUSIONS AND FUTURE WORK 

We presented a novel approach to computational modeling 
of social interactions based on modeling of essential social in¬ 
teraction predicates (ESIPs) such as joint attention and en¬ 
trainment. Our data collection was guided by social psycho¬ 
logical theory and methodology. We introduce a new “Tower 
Game” dataset consisting of audio-visual capture of dyadic 
interactions labeled with the ESIPs, that should spark new 
research in computational social interaction modeling. We 

^Tempo Similarity measures the similarity in the rate of the 
motion of the two players, and when data from both the 
players is missing generating their raw features based on 
whether their rate of motion is similar is extremely under 
constrained 


proposed a novel joint Discriminative Conditional Restricted 
Boltzmann Machine (DCRBM) model that enabled us to 
uncover actionable constituents of the ESIPs in two steps. 
Eirst, we trained the DCRBM model and second, used it to 
generate lower-level data corresponding to ESIP’s with high 
accuracy. 

Such purely computational decomposition of ESIPs into 
actionable behavioral constituents is unprecedented and pow¬ 
erful, and offers rich possibilities for further research. Eirst, 
we can substantially advance the understanding of ESIPs 
by uncovering mid-level predicates using the hidden layers of 
the DCRBM thus going beyond the current low-level feature 
generation to a multi-level understanding of the semantics 
of ESIPs. Second, we would like to extend our framework 
to multimodal streams that also include gaze, facial behav¬ 
iors, head pose and audio so as to get a full understanding 
of actionable behaviors that make up the ESIPs. Eor in¬ 
stance, we may find out that coordinating gaze and gestu¬ 
ral behavior is the most effective in establishing rapport, or 
perhaps not. Third, such a comprehensive multimodal and 
semantic model would capture the overall “rules of engage¬ 
ment” in a social interaction. Such a model would there¬ 
fore lend itself to monitoring and training applications such 
as automatic assessment of the efficacy of an interaction in 
terms of establishment of rapport-engagement and genera¬ 
tion of “interaction-realistic” avatar behaviors in a virtual 
reality environment that convey realism in terms of interac¬ 
tion dynamics rather than through photo or audio realism, 
and thus achieve immersion and engagement, as well as more 
efficacious human-robot interaction. We have thus laid the 
foundation of a computational approach that enables us to 
move from “folklore” based methods of establishing ESIPs 
to methods that are systematically arrived at through com¬ 
putational analysis of data from scientific observations. 
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